107 research outputs found

    Significant Subgraph Mining with Multiple Testing Correction

    Full text link
    The problem of finding itemsets that are statistically significantly enriched in a class of transactions is complicated by the need to correct for multiple hypothesis testing. Pruning untestable hypotheses was recently proposed as a strategy for this task of significant itemset mining. It was shown to lead to greater statistical power, the discovery of more truly significant itemsets, than the standard Bonferroni correction on real-world datasets. An open question, however, is whether this strategy of excluding untestable hypotheses also leads to greater statistical power in subgraph mining, in which the number of hypotheses is much larger than in itemset mining. Here we answer this question by an empirical investigation on eight popular graph benchmark datasets. We propose a new efficient search strategy, which always returns the same solution as the state-of-the-art approach and is approximately two orders of magnitude faster. Moreover, we exploit the dependence between subgraphs by considering the effective number of tests and thereby further increase the statistical power.Comment: 18 pages, 5 figure, accepted to the 2015 SIAM International Conference on Data Mining (SDM15

    Graph Kernels

    Get PDF
    We present a unified framework to study graph kernels, special cases of which include the random walk (Gärtner et al., 2003; Borgwardt et al., 2005) and marginalized (Kashima et al., 2003, 2004; Mahé et al., 2004) graph kernels. Through reduction to a Sylvester equation we improve the time complexity of kernel computation between unlabeled graphs with n vertices from O(n^6) to O(n^3). We find a spectral decomposition approach even more efficient when computing entire kernel matrices. For labeled graphs we develop conjugate gradient and fixed-point methods that take O(dn^3) time per iteration, where d is the size of the label set. By extending the necessary linear algebra to Reproducing Kernel Hilbert Spaces (RKHS) we obtain the same result for d-dimensional edge kernels, and O(n^4) in the infinite-dimensional case; on sparse graphs these algorithms only take O(n^2) time per iteration in all cases. Experiments on graphs from bioinformatics and other application domains show that these techniques can speed up computation of the kernel by an order of magnitude or more. We also show that certain rational kernels (Cortes et al., 2002, 2003, 2004) when specialized to graphs reduce to our random walk graph kernel. Finally, we relate our framework to R-convolution kernels (Haussler, 1999) and provide a kernel that is close to the optimal assignment kernel of Fröhlich et al. (2006) yet provably positive semi-definite

    Efficient network-guided multi-locus association mapping with graph cuts

    Get PDF
    As an increasing number of genome-wide association studies reveal the limitations of attempting to explain phenotypic heritability by single genetic loci, there is growing interest for associating complex phenotypes with sets of genetic loci. While several methods for multi-locus mapping have been proposed, it is often unclear how to relate the detected loci to the growing knowledge about gene pathways and networks. The few methods that take biological pathways or networks into account are either restricted to investigating a limited number of predetermined sets of loci, or do not scale to genome-wide settings. We present SConES, a new efficient method to discover sets of genetic loci that are maximally associated with a phenotype, while being connected in an underlying network. Our approach is based on a minimum cut reformulation of the problem of selecting features under sparsity and connectivity constraints that can be solved exactly and rapidly. SConES outperforms state-of-the-art competitors in terms of runtime, scales to hundreds of thousands of genetic loci, and exhibits higher power in detecting causal SNPs in simulation studies than existing methods. On flowering time phenotypes and genotypes from Arabidopsis thaliana, SConES detects loci that enable accurate phenotype prediction and that are supported by the literature. Matlab code for SConES is available at http://webdav.tuebingen.mpg.de/u/karsten/Forschung/scones/Comment: 20 pages, 6 figures, accepted at ISMB (International Conference on Intelligent Systems for Molecular Biology) 201

    In silico phenotyping via co-training for improved phenotype prediction from genotype

    Get PDF
    Motivation: Predicting disease phenotypes from genotypes is a key challenge in medical applications in the postgenomic era. Large training datasets of patients that have been both genotyped and phenotyped are the key requisite when aiming for high prediction accuracy. With current genotyping projects producing genetic data for hundreds of thousands of patients, large-scale phenotyping has become the bottleneck in disease phenotype prediction. Results: Here we present an approach for imputing missing disease phenotypes given the genotype of a patient. Our approach is based on co-training, which predicts the phenotype of unlabeled patients based on a second class of information, e.g. clinical health record information. Augmenting training datasets by this type of in silico phenotyping can lead to significant improvements in prediction accuracy. We demonstrate this on a dataset of patients with two diagnostic types of migraine, termed migraine with aura and migraine without aura, from the International Headache Genetics Consortium. Conclusions: Imputing missing disease phenotypes for patients via co-training leads to larger training datasets and improved prediction accuracy in phenotype prediction. Availability and implementation: The code can be obtained at: http://www.bsse.ethz.ch/mlcb/research/bioinformatics-and-computational-biology/co-training.html Contact: [email protected] or [email protected] Supplementary information: Supplementary data are available at Bioinformatics onlin

    RNA atlas of human bacterial pathogens uncovers stress dynamics linked to infection

    Get PDF
    Bacterial processes necessary for adaption to stressful host environments are potential targets for new antimicrobials. Here, we report large-scale transcriptomic analyses of 32 human bacterial pathogens grown under 11 stress conditions mimicking human host environments. The potential relevance of the in vitro stress conditions and responses is supported by comparisons with available in vivo transcriptomes of clinically important pathogens. Calculation of a probability score enables comparative cross-microbial analyses of the stress responses, revealing common and unique regulatory responses to different stresses, as well as overlapping processes participating in different stress responses. We identify conserved and species-specific 'universal stress responders', that is, genes showing altered expression in multiple stress conditions. Non-coding RNAs are involved in a substantial proportion of the responses. The data are collected in a freely available, interactive online resource (PATHOgenex). Bacterial stress responses are potential targets for new antimicrobials. Here, Avican et al. present global transcriptomes for 32 bacterial pathogens grown under 11 stress conditions, and identify common and unique regulatory responses, as well as processes participating in different stress responses.Peer reviewe

    Prediction of recovery from multiple organ dysfunction syndrome in pediatric sepsis patients.

    Get PDF
    MOTIVATION Sepsis is a leading cause of death and disability in children globally, accounting for ∼3 million childhood deaths per year. In pediatric sepsis patients, the multiple organ dysfunction syndrome (MODS) is considered a significant risk factor for adverse clinical outcomes characterized by high mortality and morbidity in the pediatric intensive care unit. The recent rapidly growing availability of electronic health records (EHRs) has allowed researchers to vastly develop data-driven approaches like machine learning in healthcare and achieved great successes. However, effective machine learning models which could make the accurate early prediction of the recovery in pediatric sepsis patients from MODS to a mild state and thus assist the clinicians in the decision-making process is still lacking. RESULTS This study develops a machine learning-based approach to predict the recovery from MODS to zero or single organ dysfunction by 1 week in advance in the Swiss Pediatric Sepsis Study cohort of children with blood-culture confirmed bacteremia. Our model achieves internal validation performance on the SPSS cohort with an area under the receiver operating characteristic (AUROC) of 79.1% and area under the precision-recall curve (AUPRC) of 73.6%, and it was also externally validated on another pediatric sepsis patients cohort collected in the USA, yielding an AUROC of 76.4% and AUPRC of 72.4%. These results indicate that our model has the potential to be included into the EHRs system and contribute to patient assessment and triage in pediatric sepsis patient care. AVAILABILITY AND IMPLEMENTATION Code available at https://github.com/BorgwardtLab/MODS-recovery. The data underlying this article is not publicly available for the privacy of individuals that participated in the study. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online

    A Kernel Two-Sample Test

    Get PDF
    We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD). We present two distribution-free tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests
    corecore